pacman::p_load(tidyverse, lubridate, plotly, treemap, ggstatsplot)
rawData <- read_csv("resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv")Take-Home Exercise 3
The following is my submission for Take-Home Exercise 3.
1. Task
In this take-home exercise, we are required to uncover the salient patterns of the resale prices of public housing property by residential towns and estates in Singapore by using appropriate analytical visualization techniques learned in Lesson 4: Fundamentals of Visual Analytics. We are encouraged to apply appropriate interactive techniques to enhance user and data discovery experiences.
For the purpose of this study, the focus would be on 3-ROOM, 4-ROOM and 5-ROOM types. We can choose to focus on either one housing type or multiple housing types. In my case, I will be focusing on multiple housing types for the year 2022.
2. Dataset
2.1 Data Source
The dataset used in this take-home exercise was downloaded from the following link: https://data.gov.sg/dataset/resale-flat-prices
We are using the Resale flat prices based on registration date from Jan-2017 onwards csv file for this exercise.

2.2 Data Preparation
First, let’s load the dataset and the libraries that we will be using:
Let’s take a look at this dataset:
head(rawData)# A tibble: 6 × 11
month town flat_…¹ block stree…² store…³ floor…⁴ flat_…⁵ lease…⁶ remai…⁷
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <chr>
1 2017-01 ANG MO … 2 ROOM 406 ANG MO… 10 TO … 44 Improv… 1979 61 yea…
2 2017-01 ANG MO … 3 ROOM 108 ANG MO… 01 TO … 67 New Ge… 1978 60 yea…
3 2017-01 ANG MO … 3 ROOM 602 ANG MO… 01 TO … 67 New Ge… 1980 62 yea…
4 2017-01 ANG MO … 3 ROOM 465 ANG MO… 04 TO … 68 New Ge… 1980 62 yea…
5 2017-01 ANG MO … 3 ROOM 601 ANG MO… 01 TO … 67 New Ge… 1980 62 yea…
6 2017-01 ANG MO … 3 ROOM 150 ANG MO… 01 TO … 68 New Ge… 1981 63 yea…
# … with 1 more variable: resale_price <dbl>, and abbreviated variable names
# ¹flat_type, ²street_name, ³storey_range, ⁴floor_area_sqm, ⁵flat_model,
# ⁶lease_commence_date, ⁷remaining_lease
summary(rawData) month town flat_type block
Length:146872 Length:146872 Length:146872 Length:146872
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
street_name storey_range floor_area_sqm flat_model
Length:146872 Length:146872 Min. : 31.0 Length:146872
Class :character Class :character 1st Qu.: 82.0 Class :character
Mode :character Mode :character Median : 94.0 Mode :character
Mean : 97.6
3rd Qu.:113.0
Max. :249.0
lease_commence_date remaining_lease resale_price
Min. :1966 Length:146872 Min. : 140000
1st Qu.:1985 Class :character 1st Qu.: 358000
Median :1996 Mode :character Median : 448000
Mean :1996 Mean : 478316
3rd Qu.:2007 3rd Qu.: 565800
Max. :2019 Max. :1418000
As you can see, the month column is in the format of YYYY-MM. Using the month column, we can actually create two new columns - Date (in YYYY-MM-DD format) and Year. These fields will be useful when building visualizations.
data <- rawData %>%
mutate(Year = year(ym(month))) %>%
mutate(Date = as.Date(paste(month, "-01", sep = ""),
format = "%Y - %m - %d"))
summary(data) month town flat_type block
Length:146872 Length:146872 Length:146872 Length:146872
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
street_name storey_range floor_area_sqm flat_model
Length:146872 Length:146872 Min. : 31.0 Length:146872
Class :character Class :character 1st Qu.: 82.0 Class :character
Mode :character Mode :character Median : 94.0 Mode :character
Mean : 97.6
3rd Qu.:113.0
Max. :249.0
lease_commence_date remaining_lease resale_price Year
Min. :1966 Length:146872 Min. : 140000 Min. :2017
1st Qu.:1985 Class :character 1st Qu.: 358000 1st Qu.:2018
Median :1996 Mode :character Median : 448000 Median :2020
Mean :1996 Mean : 478316 Mean :2020
3rd Qu.:2007 3rd Qu.: 565800 3rd Qu.:2021
Max. :2019 Max. :1418000 Max. :2023
Date
Min. :2017-01-01
1st Qu.:2018-09-01
Median :2020-07-01
Mean :2020-03-27
3rd Qu.:2021-10-01
Max. :2023-02-01
Since we will only be focusing on the year 2022, we will filter the dataset to only keep records for the year 2022. Also, we only want to work with records which have a flat type of 3-Room, 4-Room or 5-Room. Let’s filter the data to meet this condition.
resaleFlats <- data %>%
filter(Year == "2022") %>%
filter(flat_type %in% c("3 ROOM", "4 ROOM", "5 ROOM"))
head(resaleFlats)# A tibble: 6 × 13
month town flat_…¹ block stree…² store…³ floor…⁴ flat_…⁵ lease…⁶ remai…⁷
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <chr>
1 2022-01 ANG MO … 3 ROOM 320 ANG MO… 07 TO … 73 New Ge… 1977 54 yea…
2 2022-01 ANG MO … 3 ROOM 225 ANG MO… 07 TO … 67 New Ge… 1978 55 yea…
3 2022-01 ANG MO … 3 ROOM 331 ANG MO… 07 TO … 68 New Ge… 1981 58 yea…
4 2022-01 ANG MO … 3 ROOM 534 ANG MO… 07 TO … 82 New Ge… 1980 57 yea…
5 2022-01 ANG MO … 3 ROOM 578 ANG MO… 04 TO … 67 New Ge… 1980 57 yea…
6 2022-01 ANG MO … 3 ROOM 452 ANG MO… 01 TO … 83 New Ge… 1979 56 yea…
# … with 3 more variables: resale_price <dbl>, Year <dbl>, Date <date>, and
# abbreviated variable names ¹flat_type, ²street_name, ³storey_range,
# ⁴floor_area_sqm, ⁵flat_model, ⁶lease_commence_date, ⁷remaining_lease
The data is now ready for visualization.
3. Data Exploration & Visualizations
Let’s first explore the fields we have in our dataset, such as the unique values we have in our key categorical variables - Town, Flat Types and Flat Models.
unique(resaleFlats$town) [1] "ANG MO KIO" "BEDOK" "BISHAN" "BUKIT BATOK"
[5] "BUKIT MERAH" "BUKIT PANJANG" "BUKIT TIMAH" "CENTRAL AREA"
[9] "CHOA CHU KANG" "CLEMENTI" "GEYLANG" "HOUGANG"
[13] "JURONG EAST" "JURONG WEST" "KALLANG/WHAMPOA" "MARINE PARADE"
[17] "PASIR RIS" "PUNGGOL" "QUEENSTOWN" "SEMBAWANG"
[21] "SENGKANG" "SERANGOON" "TAMPINES" "TOA PAYOH"
[25] "WOODLANDS" "YISHUN"
unique(resaleFlats$flat_type)[1] "3 ROOM" "4 ROOM" "5 ROOM"
unique(resaleFlats$flat_model) [1] "New Generation" "Improved" "Model A"
[4] "DBSS" "Simplified" "Premium Apartment"
[7] "Standard" "Model A-Maisonette" "Model A2"
[10] "Type S1" "Type S2" "Premium Apartment Loft"
[13] "Adjoined flat" "Terrace" "3Gen"
[16] "Improved-Maisonette"
For this exercise, we will focus on the following areas:
- Resale Flats by Flat Models
- Resale Flats by Flat Types
- Resale Prices by Flat Models
- Resale Prices by Flat Types
- Resale Prices vs Floor Area
- Resale Prices vs Floor Area by Month
3.1 Resale Flats by Flat Models
The code chunk below aims to explore the number of resale flats by flat models.
Code
fig3 <- ggplot(data = resaleFlats,
aes(x = flat_type,
fill = flat_type,
colour = flat_type)) +
geom_bar(width = 0.5) +
facet_wrap(~ flat_model) +
labs(title = "Trend of Flat Models by Flat Types in Singapore Towns, 2022") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
ggplotly(fig3)Based on the figure above, it is clear that certain flat models are more common that the others, and these common flat models are either Improved, Model A or New Generation. Another reason as to why the other flat models do not seem to have much data could be because we already filtered the dataset to only focus on 3 Room, 4 Room and 5 Room flats.
3.2 Resale Flats by Flat Types
The following code chunk aims to explore the number of resale flats by flat type for the various towns in Singapore.
Code
fig1 <- ggplot(data = resaleFlats,
aes(x = flat_type,
fill = flat_type,
colour = flat_type)) +
geom_bar(width = 0.5) +
facet_wrap(~ town) +
labs(title = "Trend of Flat Types in Singapore Towns, 2022") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
ggplotly(fig1)Based on the interactive plot above, we can deduce that Jurong East, Jurong West and Serangoon do not seem to have many resale flats in their towns. On the other hand, Punggol, Sengkang, Woodlands and Yishun seems to have a higher number of resale flats in their towns.
Let’s narrow down our data exploration by grouping the towns based on their average resale prices. The code chunk below groups the dataset by date, towns and their flat types and computes the average resale price for this grouping.
Code
groupedFlats <- resaleFlats %>%
group_by(Date, town, flat_type) %>%
summarise(avgPrice = mean(resale_price)) %>%
ungroup()
groupedTowns <- resaleFlats %>%
group_by(Date, town, flat_type) %>%
summarise(avgPrice = mean(resale_price))Another way to visualize the number of resale flats by flat type would be to plot a treemap. This treemap will allow us to gain quick insights into which town has a higher contribution to each flat type.
Code
level1 <- groupedTowns$flat_type
level2 <- groupedTowns$town
avgPrice <- groupedTowns$avgPrice
data <- data.frame(level1, level2, avgPrice)
treemap(data,
index = c("level1", "level2"),
vSize = c("avgPrice"),
type = "index",
palette = "Set2")
As seen in the treemap above, Central Area seems to have one of the highest average prices of resale flats across the different flat types in Singapore. Other towns which have a high average price of resale flats across the different flat types include Queenstown, Bishan and Bukit Merah.
3.3 Resale Prices vs Flat Models
One way to compare the resale prices by flat models would be to plot a violin plot to view the distribution as well as the mean price of the resale flats. The code chunk below intends to do that. We are focusing on the flat models of Improved, Model A and New Generation as these are the models which had a higher number of resale flats as per our exploration in previous steps.
Code
flatModels <- resaleFlats %>%
filter(flat_model %in% c("Improved", "Model A", "New Generation"))
fig6 <- ggbetweenstats(
data = flatModels,
x = "flat_model",
y = "resale_price",
type = "np") +
ggtitle("Non-Parametric Analysis of Resale Prices by Flat Model")
ggplotly(fig6)From the above violin plot, we can observe that the mean of resale prices for Improved flats slightly higher that mean of resale price of Model A. On the other hand, the mean of resale price of New Generation flats are significantly lower.
3.4 Resale Prices by Flat Types
The following code chunk plots the average resale prices by flat types across 2022 for each town:
Code
fig2 <- ggplot(groupedTowns,
aes(x = Date,
y = avgPrice)) +
geom_line(aes(colour = flat_type)) +
facet_wrap(~ town) +
labs(title = "Trend of Resale Prices by Flat Types in Singapore Towns, 2022") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
ggplotly(fig2)As seen in the interactive plot above, the average resale prices in towns such as Central Area, Bishan, Clementi and Toa Payoh seem to be very high as when compared to towns such as Choa Chu Kang, Jurong West and Sembawang.
Another way to visualize the resale prices by flat types would be plot a violin plot.
Code
fig7 <- ggbetweenstats(
data = resaleFlats,
x = "flat_type",
y = "resale_price",
type = "np") +
ggtitle("Non-Parametric Analysis of Resale Prices by Flat Type")
ggplotly(fig7)3.5 Resale Prices vs Floor Area
Let’s use the code chunk below to find out if there is a correlation between the resale price and the floor area:
Code
fig4 <- ggscatterstats(
data = resaleFlats,
x = floor_area_sqm,
y = resale_price,
marginal = FALSE,
point.args = list(size = 0.5,
stroke = 0,
colour = "grey"),
smooth.line.args = list(linewidth = 0.5,
colour = "black",
method = "lm")) +
labs(title = "Correlation between Resale Prices and Floor Area, 2022") +
facet_wrap(~ town)
ggplotly(fig4)This scatterplot helps to depict the relationship between the two numerical variables resale_price and floor_area_sqm. So how do we read this plot? The angles of the lines depict the relationship between the two variables, and this could be either a positive or negative relationship. At once glance, we can tell that there is a positive correlation between these two variables. However, when looking at towns such as Central Area and Toa Payoh, we can note that the relationship is not as linear as the relationship in other towns.
3.6 Resale Prices vs Floor Area by Month
Let’s use the code chunk below to find out the trend of the resale prices by floor area across the months. We will narrow down our analysis to only focus on 5 Room flats for this visualization.
Code
improvedModels <- resaleFlats %>%
filter(flat_model == "Improved") %>%
filter(flat_type == "5 ROOM")
heatmap <- ggplot(improvedModels,
aes(x = Date,
y = floor_area_sqm,
fill = resale_price)) +
geom_tile() +
labs(title = "5 Room Flat's Resale Prices by Floor Area in 2022")
ggplotly(heatmap)From this heatmap, we will be unable to predict the resale price of a 5 room flat for the next month as you can see that there is an inconsistency in the resale prices when compared against the floor area. This could be something that further explore in the future as there could be other factors that influence the resale prices.
This concludes the work submitted for Take-Home Exercise 3. Thank you!